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On the Use of Topological Features and Hierarchical Characteri- 
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Abstract - Many features of complex systems can now be unveiled by applying statistical physics 
methods to treat them as social networks. The power of the analysis may be limited, however, 
by the presence of ambiguity in names, e.g. caused by homonymy in collaborative networks. In 
this paper we show that the ability to distinguish between homonymous authors is enhanced when 
longer-distance connections are considered, rather than looking at only the immediate neighbors of 
a node in the collaborative network. Optimized results were obtained upon using the 3rd hierarchy 
in connections. Furthermore, reasonable distinction among authors could also be achieved upon 
using pattern recognition strategies for the data generated from the topology of the collaborative 
network. These results were obtained with a network from papers in the arXiv repository, into 
which homonymy was deliberately introduced to test the methods with a controlled, reliable 
dataset. In all cases, several methods of supervised and unsupervised machine learning were 
used, leading to the same overall results. The suitability of using deeper hierarchies and network 
topology was confirmed with a real database of movie actors, with the additional finding that 
the distinguishing ability can be further enhanced by combining topology features and long-range 
connections in the collaborative network. 
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Introduction. — The e-Science paradigm may be ex- 
ploited to transform the tremendous amounts of data elec- 
tronically available into useful knowledge in varied fields. 
In science and technology, for example, large databases in- 
clude citation networks [l . journals databases (2], arXfv 1 
CiteSeefl DBLlfJ Web of ScienccQand Google Scholaf 
whose analysis may assist in the decision-making process 
of funding agencies and academic institutions. Citation 
networks, in particular, have been studied with a variety 
of purposes, e.g. identifying the most relevant papers in 
a survey and quantifying the impact of journals, confer- 
ences, researchers and institutions. The applicability of 
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these databases may be hampered, nevertheless, if they 
are not accurate or if they contain ambiguities. For sci- 
entific databases, two major problems appear in lists of 
authors of scientific articles: i) the same author may be 
referenced in different ways and ii) distinct authors may 
have identical names, which is especially important for 
Chinese and Korean researchers |3j. 

Several methods have been used to resolve ambiguities 
of authors names in scientific papers, which is a task akin 
to several other problems, such as matching |4] and du- 
plicate detection [5j. These methods are mostly based on 
text mining and on natural language processing [6], be- 
cause researchers are believed to be fairly characterized by 
their research field, so that textual similarity measures are 
able to cluster together manuscripts authored by the same 
scientist. The list of co-authors has also been used as a cri- 
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terion for disambiguation [7j|8] since authors tend to keep 
a specific collaboration group. Of lesser importance are 
the criteria based on the journal name [9j, language of the 
manuscript [9], authors' affiliation |9j, self-citations 10 



and source URL metadata 10 



Another approach to detect and repair inconsistencies in 
databases is to represent them as complex networks lpT 



In this paper, we use concepts and metrics of networks 
to distinguish between authors represented by the same 
alias in a collaborative network. The network was re- 
trieved from arXi\j^J where homonymy was deliberately 
introduced to have a reliable dataset, and distinction was 
made with two approaches. In the first, we employed 
deeper local hierarchies [12] for analyzing the connectivity 
of the collaborative network, while in the second topolog- 
ical features of the network were used. The data gener- 
ated from the analysis were treated with projection tech- 
niques 1 13, 14 to reduce dimensionality and pattern recog- 
nition methods were used in distinguishing authors. The 
two methodologies, with deeper hierarchies and topologi- 
cal features, were combined to disambiguate actors names 
in the IMDrd database. 

Methodology. 

Databases. Two databases were used, the first of 
which is a set of preprint manuscripts from the arXiv 
repository (see footnote 1). The articles were retrieved 
using the keywords "complex network" or scale free. The 
second database was retrieved from the IMDb repository 
(see footnote 6) . Only movies released after the year 2000 
were considered. Details concerning both databases are 
given in the Supplementary Information^] (SI) . 

Network Formation. Collaborative networks were 
generated using the two databases, in which the nodes 
represented the authors or actors, being linked if they co- 
participated in a paper or movie. The process of building 
the collaborative network of authors is illustrated in Fig- 
for a small network with 7 fictitious papers (see 
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caption), while Figure lb shows the giant component of 
the arXiv collaborative network for the subject of "com- 
plex networks" . In the fictitious papers shown in the fig- 
ure, the aim is to disambiguate authors with the same 
name AA. Note that it is necessary to represent the au- 
thor of interest as different nodes in the network because 
the disambiguation is performed at the paper level. Any- 
way, one does not need to create a completely different 
network every time one wants to disambiguate a specific 
author. A common network reflecting all collaborations 
in the database could be maintained and only a few nodes 
and edges would be added/removed from the common net- 
work for the analysis of each new particular author. Thus 
the disambiguation process would fundamentally depend 



6 http:/ /www. arXiv.org 

7 http://www.imdb. com 

8 The Supplementary Information is available from [https : //dl . | 
dropbox . com/u/2740286/eplSI9 mai .pdf | 



on the number of papers co-authored by the author under 
analysis. 

The strength of the connection between vertices i and j 
using the weight toy is: 



y = X) iwp where 

pen m 

if i and j appear in paper p, 
otherwise 



(1) 



(2) 



II represents the set of all papers in the database and 
\\p\\ is the number of authors of a given paper p. The 
weight was divided by ||p|| to take into account the finding 
that relationships among few authors are usually stronger 
than those involving several authors 15 . The weight of 



the links is not shown in Figure |la[ but its computation 
is straightforward. For instance, the weight for the link 
between AB and AC is 1/3 while that for AE and AF is 
1 (1/2 from paper 6 plus 1/2 from paper 7). 

Characterization of Entities Through Connectivity 
Analysis. In the strategy based on co-authorship, each 
occurrence of an ambiguous entity is characterized by re- 
lations of co-participation in the same paper/movie. Let 
e be an ambiguous entity and let vt be the vector de- 
scribing the co-authorship features of e. Each element i in 
vt represents one of the possible entities in the database. 
As such, if i and e appear in the same document, then 
vt(i) = 1- Otherwise, = 0. In order to reduce the 

complexity of the problem, two techniques to reduce di- 



mensionality were used: principal component analysis 14 
(PC A) and latent semantic analysis [13] (LSA). Because 
it performed better than PCA in the experiments, all of 
the results reported here were obtained with LSA. Both 
techniques are described in the SI. 

Characterization of Entities Through Topological Anal- 
ysis. In addition to the strategy based on co-authors, we 
evaluated the suitability of the local topological structure 
for disambiguating names. The measurements used were: 
degree k, which quantifies the number of links; strength s, 
which quantifies the sum of the weights of links; clustering 
coefficient C, which measures the density of links around 
the node of interest; average degree (k n ) and strength (s n ) 
of immediate neighbors; and the standard deviations 
and o~ Sn of degree and strength of neighbors, respectively. 
Further details on complex networks measurements are 
given in Refs. 11 16 . 

Hierarchical Characterization. Inspired by studies 
showing that the expansion of local analysis for further 



neighbors allows better characterization of networks 17 



we introduced the hierarchical analysis in the characteri- 
zation of collaborative networks. When the hierarchy of a 
given node is expanded, all of its neighbors v n are lumped 
into a single new node v^. As a result, if any other node 
Vi of the network was connected to v n before the expan- 
sion, then afterward Vh will be connected to v^. In our 
experiments, the networks were expanded twice, therefore 
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Fig. 1: (a) Example of a collaborative network built from a fictitious list of authors and (b) giant component of the collaborative 
network of a fraction of the arXiv repository. AX stands for Author X. In (a), the following authorship of papers was considered: 
paper 1 (AA, AB and AC), paper 2 (AA, AG and AD), paper 3 (AA, AF, AG and AI), paper 4 (AA and AG), paper 5 (AC 
and AD), paper 6 (AE and AF) and paper 7 (AF and AE). 



three hierarchies were generated. Details on the hierar- 
chical characterization in complex networks are given in 
Refs. ~ 
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Pattern Recognition Techniques. Pattern recognition 
techniques that induce classifiers from the training set 
were used in the disambiguation task, employing features 
extracted from the analysis of connectivity and topology 
of the collaborative networks. The quality of the results 
was then evaluated using the 10-fold cross-validation tech- 
which was chosen because it is robust since 
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the training set is always different from the evaluation 
set. Thus, it prevents that overfitted inductors take high 
values of accuracy rate. We used methods belonging to 
the paradigms, namely, supervised and unsupervised tech- 
niques. In the former, a function is inferred upon the 
labeled training data. The four techniques used were: 
C4.5 algorithm 
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which generates trees based on the 
gain provided by each feature; Naive Bayes algorithm [19] , 
which uses the Bayes theorem; k nearest neighbor algo- 
rithm [19] (&NN) , which classifies an external unknown in- 
stance according to the most similar instance of the train- 
ing database in a normalized space including all features; 
and RIPPER algorithm, which generates a set of explicit 
rules to classify new instances. In the unsupervised meth- 
ods, one does not know in advance which element belongs 
to each class, what is known is that a given pair of names 
belongs to the same entity. The techniques used were: 
k- means 19 , Expectation Maximization (EM) [19], sin- 



gle linkage [20] , complete linkage 20 , average linkage 20 
and Ward's linkage 



20 



After the classification phase, 
two quality indicators were employed to assess the perfor- 
mance: the rate of instances correctly classified and the 



f-measure 21 , which represents a balance between preci- 



sion and recall of correctly classified instances. The algo- 
rithms, the cross-validation technique and the f-measure 
are described in the SI. 

The reasons why several supervised and unsupervised 
machine learning methods were used are related to ensur- 
ing robustness of the data analysis, especially because we 
shall show that the overall conclusions are independent of 
the pattern recognition method. 

Results and Discussion. 

Disambiguation based on the connectivity of authors. 
In this strategy we used a set of N — 1,842 features, 
where N is the number of authors of the arXiv database. 
Because the data including the N features for the various 
homonymous authors had a high dimension, we employed 
PCA and LSA to reduce the dimension. Then we used the 
pattern recognition strategies mentioned in the methodol- 
ogy. The analysis of the immediate neighborhood of au- 
thors in the collaborative network allows for distinction of 
homonymous authors, with the overall accuracy increas- 
ing when deeper hierarchies were used. Figure [2] shows the 
f-mcasure obtained with the 3rd hierarchy for the arXiv 
network, which indicates that the performance decreased 
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with the number of homonymous authors, as expected, 
and this applies to all algorithms tested (see also Figure 
SI of the SI). The superior performance of the analysis 
considering the 2nd and 3rd hierarchies is depicted in Fig- 
ure [3j which shows the percentage in which each hierarchy 
achieved the best performance (a statistical analysis of the 
figure is provided in Table SI of the SI). These results are 
consistent with the finding in a previous study where the 
use of higher hierarchies improved the local characteriza- 
tion of networks 17 . Hierarchies higher than 3 were not 
attempted owing to the high computational cost. Never- 
theless, the performance is unlikely to increase consider- 
ably for deeper hierarchies, and should indeed be expected 
to decrease if very high hierarchies were used because more 
information might be lost than gained 
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Fig. 2: f-measure for the disambiguation task from the anal- 
ysis of connectivity of the collaborative network using the 3rd 
hierarchy. The algorithms used were (a) C4.5 and (b) fcNN-1. 
In all cases, the ability to distinguish among authors decreased 
as the number of ambiguous entities increased. 

The disambiguation process in collaborative networks 
may depend on the edge density of the collaborative net- 
works. If everybody is connected to everybody else, then 
the disambiguation process tends to deteriorate and other 
factors in addition to co-authorship relations should be 
included to discriminate authors' names. Fortunately, in 
practice, collaborative networks are organized in commu- 
nities so that the clustering is high only within commu- 
nities. As such, the high clustering within communities 
is desirable when ambiguous authors belong to distinct 
communities. The higher the clustering the smaller the 
number of external links will be. As a result, authors' co- 
authorship patterns will be quite distinct provided they 
belong to different communities. 

Topological features used in distinguishing authors. To 
our knowledge this is the first attempt to use the topology 
of collaborative networks for disambiguating authors. We 
used a set of 7 topological measurements described in the 
methodology, but in principle other local measurements 
could have also been employed. The results in Figure 
[I] show that the overall discrimination ability using the 
network topology is worse than that obtained with the 
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Fig. 3: Percentage of cases where each hierarchy achieves the 
higher value of the f-measure for the following algorithms: (a) 
complete; (b) Ward; (c) K-Means; (d) Expectation Maximiza- 
tion. Note that, in most cases, the 3rd hierarchy outperforms 
the 2nd and 1st hierarchies. A similar behavior was observed 
for the other algorithms. 
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Fig. 4: f-measure for the disambiguation task based on the 
topological approach. The algorithms used were (a) C4.5 and 
(b) fcNN-1. In all cases, the ability to discriminate authors de- 
creased as the number of ambiguous entities increased (Figure 
S2 of the SI brings the analogous curves for the other algo- 
rithms) . 



analysis of connectivity (see Figure [2]). However, the dis- 
crimination based on topological features was found to be 
statistically significant as depicted in Table S2 of the SI, 
which points to authors exhibiting particular patterns of 
connectivity in collaborative networks. 

We also investigated which topological features were 
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most efficient for discriminating authors, ranking them ac- 
cording to two criteria. The first criterion is based on the 
information gain achieved for each measurement and the 
second one is based on a methodology analogous to the 
Mann- Whitney U test [22] (see SI for details regarding 
both methods). While the former has the advantage that 
the ranking is algorithm independent, it might overlook in- 
teractions between features because they are evaluated in- 
dividually. For this reason, we also devised a methodology 
that does not ignore interactions between features. More 
specifically, the Mann- Whitney U test sorts the classifiers 
according to their accuracy rate. Then, the relevance of 
attributes is assigned according to their frequency in the 
top classifiers]^] 

The most efficient measurements for discrimination were 
(k n ) and (s n ), as shown in the ranking in Table [T] bring- 
ing the percentage of cases in which these measurements 
appeared as the best feature in the most efficient algo- 
rithm (ANN). The table also shows the corresponding p- 
value, considering a random ranking of measurements as 
null model, for both ranking criteria. Significantly, the 
clustering coefficient C did not appear among the most 
important features for distinction. These results may be 
interpreted as follows. Discrimination appears to be gov- 
erned by the average number of co-authors of neighbors, 
and to a lesser extent by the strength of such connec- 
tions. Therefore, the most relevant information is actu- 
ally the number of connections with external authors, i.e., 
the number of connections with authors who have never 
co-authored a paper with the author represented by the 
node under analysis. This means that the structure of the 
neighbors allows a better characterization than the local 
structure of the node itself, consistent with the findings 
from the analysis of connectivity. 

Connectivity and topology combined in a real network. 

The strategies based on the connectivity (with 3rd hi- 
erarchy) and topological features were combined in the 
disambiguation task for a real database derived from the 
IMDb database for actors in movies (details are given in 
the SI). Table [2] shows the accuracy rate achieved for each 
actor and the corresponding p-value (assuming as null 
model a random disambiguation system). For each ac- 
tor, the scores shown were obtained with topological (TP) 
features, connectivity (CN) and with the combination of 
both strategies (CN + TP). For Attila and Matt Hughes, 
TP-features alone performed worse than CN-features, but 
the best result was reached with the strategies combined. 
Likewise, for Igor and Justin Long, the combination gen- 
erated the best disambiguation. For Bill Bailey, surpris- 
ingly, TP-features alone yielded the best results. For Steve 
Austin, the combination also yielded the best result, al- 
though the same quality had already been obtained with 
CN-features. Finally, for Christian, even the random dis- 



Table 2: Accuracy rate and p- value obtained for the disam- 
biguation based on topological (TP) and connectivity (CN) 
measurements. The combination (CN+TP) of features was 
also examined. For all actors, the topological features appear 
in the best classifiers. 



Actor 


Classif. 


Acc. 


p- value 


Features 


Attila 


C4.5 
kNN 
kNN 


64.3 % 

78.6 % 

85.7 % 


1.0 x lO" 1 
2.8 x 10~ 2 
6.0 x 10~ 3 


TP 

CN 
CN + TP 


Matt 
Hughes 


kNN 
kNN 
kNN 


77.4 % 
80.7 % 
83.9 % 


5.8 x 10- 1 

8.5 x 10~ 2 

3.6 x 10~ 2 


TP 

CN 
CN + TP 


Igor 


RIP. 

kNN 
kNN 


81.8 % 
81.8 % 
86.4 % 


5.5 x 10~ 2 
5.5 x 10~ 2 
1.8 x 10~ 2 


TP 
CN 
CN+TP 


Justin 
Long 


kNN 
kNN 
RIP. 


95.8 % 
93.6 % 
95.8 % 


2.2 x 10~ 4 
3.0 x 10~ 2 
2.2 x 10~ 4 


TP 
CN 
CN+TP 


Bill 

Bailey 


C4.5 
kNN 
kNN 


96.0 % 
87.8 % 
87.8 % 


2.2 x 10~ 2 
6.1 x 10- 1 
6.1 x 10- 1 


TP 
CN 
CN+TP 


Steve 
Austin 


Bayes 
Bayes 
kNN 


84.8 % 

88.9 % 
88.9 % 


1.0 x lO" 1 
2.2 x 10~ 2 
2.2 x 10~ 2 


TP 
CN 
CN + TP 


Christian 


kNN 
kNN 
kNN 


97.1 % 
97.6 % 
97.6 % 


4.5 x 10- 1 
2.9 x 10- 1 
2.9 x 10- 1 


TP 

CN 
CN + TP 



ambiguation system was already very efficient, and there- 
fore a comparison is void. The results in this study are 
analogous to the findings in the task of recognizing author- 
ship in written texts [23] , in which the topology was proven 
useful for revealing patterns related to writing style. 

Conclusion. — Two innovative approaches were in- 
troduced in this paper for disambiguating names in collab- 
orative networks. In the first, we extended the traditional 
method based on the connectivity with immediate neigh- 
bors in networks by incorporating the analysis of higher 
hierarchies. We showed that the 3rd hierarchy leads to 
a considerably improved performance in the disambigua- 
tion task. In the second approach, we used for the first 
time to our knowledge the topology of networks for dis- 
ambiguating names. The two most efficient measurements 
for distinguishing authors were {k n ) and (s„), i.e. distinc- 
tion depends mainly on the connectivity of the neighbors. 
This reinforces the importance of considering deeper hier- 



9 Even though the Mann-Whitney U test relies on the pattern 
recognition strategy to perform the ranking, in our experiments all 
algorithms displayed practically the same results. 



archies while analyzing collaborative networks 17 . 

All of these results were obtained for a subnetwork from 
the arXiv repository for the area of complex networks, 
in which ambiguity was deliberately introduced. The op- 
tion for this artificial system was made to ensure a reli- 
able dataset and the statistical significance of our anal- 
ysis. Furthermore, the robustness of the analysis was 
ensured by employing various pattern recognition meth- 
ods, belonging to both supervised and unsupervised ma- 
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Table 1: % of cases in which (k n ) and (s„) appeared in the first position of the ranking performed using the Information 
Gain Criterion and the Mann Whitney test |22] with the fcNN algorithm. The p-value corresponds to the likelihood of the 
corresponding percentage to be obtained considering as null model a random ranking of measurements. #N is the number of 
ambiguous names for an author. 





Information Gain Criterion 


Mann Whitney criterion 


#A 




(k n ) 






( s n) 




(K) 






(Sn) 




% 


p-value 




% 


p- value 


% 


p- value 




% 


p- value 


Two 


56.8 % 


< 1.0 x 10- 


15 


31.1 % 


9.6 x 10- 5 


45.2 % 


3.8 x 10" 14 


50.0 % 


< 1.0 io- 15 


Three 


67 .4 % 


< 1.0 x 10~ 


15 


23.1 % 


3.9 x 10- 2 


70.0 % 


< 1.0 x 10" 


15 


29.5 % 


1.2 IO" 3 


Four 


73.2 % 


< 1.0 x 10- 


15 


20.6 % 


7.0 x 10- 2 


73.7 % 


< 1.0 x 10" 


15 


26.3 % 


2.1 10" 2 


Five 


73.2 % 


< 1.0 x 10- 


15 


23.1 % 


3.9 x 10- 2 


81.6 % 


< 1.0 x 10- 


15 


18.4 % 


7.3 10- 1 


Six 


75.8 % 


< 1.0 x 10- 


15 


20.6 % 


7.0 x 10- 2 


91.1 % 


< 1.0 x 10- 


15 


8.9 % 


~ 1.0 


Seven 


71.1 % 


< 1.0 x 10- 


15 


25.8 % 


1.0 x 10- 2 


86.3 % 


< 1.0 x 10- 


15 


13.2 % 


~ 1.0 


Eight 


66.3 % 


< 1.0 x 10- 


15 


30.0 % 


3.0 x 10- 4 


96.8 % 


< 1.0 x 10" 


15 


3.2 % 


~ 1.0 


Nine 


69.4 % 


< 1.0 x 10- 


15 


26.3 % 


7.2 x lO- 3 


82.6 % 


< 1.0 x 10" 


15 


17.4 % 


8.4 10- 1 


Ten 


42.6 % 


7.7 x 10- 13 


52.6 % 


< 1.0 x 1Q- 15 


92.1 % 


< 1.0 x 10" 


15 


7.9 % 


~ 1.0 



chine learning paradigms, with which similar results were 
obtained. Obviously, the innovative approaches can be 
extended to real networks, and indeed we showed that 
for a network of movie actors. In particular, we noted 
that combination of the two approaches leads to improved 
performance in disambiguating names, which is promising 
for further applications requiring removal of ambiguity in 
databases. 

For future works, we intend to further investigate if the 
hierarchical characterization introduced in the traditional 
analysis can further improve the ability of discrimination 
in the topological characterization. Also, we intend to ana- 
lyze the performance of similarity measures based on com- 
plex networks, such as the Katz similarity 
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Another 

point of future investigation concerns the verification of 
the precise influence of sampling on the topological anal- 
ysis, because incomplete databases usually generate worst 
disambiguating systems (see SI). Finally, we plan to apply 
the topological approach to the problem of disambiguating 
words in written texts (word sense disambiguation) [24]. 
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